The corresponding journal entry for this assignment is available on GitHub at this link: https://github.com/bcb420-2022/Jedid_Ahn/wiki/Entry-%2311:-Notes-from-lecture-for-A3
# Libraries to install here.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
BiocManager::install("GEOquery")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
if (!requireNamespace("circlize", quietly = TRUE)) {
install.packages("circlize")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
if (!requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages("gprofiler2")
}
# Libraries to load here.
library(Biobase)
library(GEOquery)
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gprofiler2)
# Figure count: Start at 1.
count <- 1
The GEO dataset that I chose was GSE66261, which examines the expression and functions of long noncoding RNAs during human T helper cell differentiation.
GEO <- "GSE66261"
count_folder_path <- paste0(getwd(), "/", GEO)
if (!dir.exists(count_folder_path)){
count_file_path <- rownames(GEOquery::getGEOSuppFiles(GEO))[1]
} else{
count_file_path <- list.files(path = count_folder_path, full.names = TRUE)[1]
}
GSE66261 <- read.delim(count_file_path, header = TRUE, check.names = FALSE)
To better understand the data, groups were defined by formatting according to library, cell type, and condition. There are 2 libraries: 2695 and 2960, 2 cell types: Primary and effector, and 3 conditions: TH1, TH2, and TH17.
listed_titles <- colnames(GSE66261)[3:14]
gsm_titles <- c("TH1_Primary_2695", "TH2_Primary_2695", "TH17_Primary_2695", "TH1_Effector_2695", "TH2_Effector_2695", "TH17_Effector_2695", "TH1_Primary_2960", "TH2_Primary_2960", "TH17_Primary_2960", "TH1_Effector_2960", "TH2_Effector_2960", "TH17_Effector_2960")
exp_names <- data.frame(listed_titles, gsm_titles)
samples <- data.frame(lapply(exp_names$gsm_titles,
function(x) { rev(unlist(strsplit(x, split = "_"))) }))
colnames(samples) <- listed_titles
rownames(samples) <- c("library", "cell_type", "condition")
samples <- data.frame(t(samples))
Before normalization, the dataset was filtered by removing genes with low counts. Using edgeR protocol, features without at least 1 read per million in n of the samples were removed. Since there are 6 samples of each group, n = 6.
n <- 6
# Translate out counts into counts per million using the edgeR package.
cpms = edgeR::cpm(GSE66261[, 3:14])
rownames(cpms) <- GSE66261[, 1]
# Get rid of low counts
keep = rowSums(cpms > 1) >= n
GSE66261_filtered = GSE66261[keep, ]
rownames(GSE66261) <- 1:nrow(GSE66261)
The original dataset already came with HUGO gene symbols, so no mapping of symbols was required. In addition, tests showed that nearly 90% of symbols lined up with the symbols obtained from biomaRt.
Tests showed that there were 18 rows in total that consists of mappings to the same symbol. Two rows were updated by replacing their original HUGO gene symbol with the mapped gene symbol, while the rest were dropped to avoid compromising normalization and downstream analysis.
any(GSE66261_filtered$Name == "RNY1P13")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000201900")] <- "RNY1P13"
any(GSE66261_filtered$Name == "STRA6LP")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000254876")] <- "STRA6LP"
remove = duplicated(GSE66261_filtered$Name)
GSE66261_filtered <- GSE66261_filtered[!remove, ]
Trimmed means of M-value (TMM) was chosen as the normalization technique as it a specialized normalization technique for RNASeq datasets.
filtered_data_matrix <- as.matrix(GSE66261_filtered[, 3:14])
rownames(filtered_data_matrix) <- GSE66261_filtered$Name
d = edgeR::DGEList(counts = filtered_data_matrix, group = samples$cell_type)
d = edgeR::calcNormFactors(d)
normalized_counts <- as.data.frame(edgeR::cpm(d))
rownames(normalized_counts) <- GSE66261_filtered$Name
Figure 1: We observe that TH1 and TH2 samples cluster all together as they are both involved in immune response, while TH17 samples cluster together separately from TH1 and TH2.
Figure 2: The boxplot shows that there aren’t significant differences in the mean for each sample after normalization.
Figure 3: The soft bimodal distribution is something to keep in mind when performing differential expression analysis.
Figure 4: The BCV plot confirms the trend of more gene expression leading to dispersion values being closer together, and less variation overall.
normalized_counts
Before running differential expression analysis, we have to create a linear model in R by creating a design matrix. We will account for both library variability and the condition of interest, which is contributing to the differential expression. This is demonstrated by the MDS plot shown in Figure 1, where samples are clustering according to condition.
model_design <- model.matrix(~ samples$library + samples$condition)
Fit the normalized counts to the model design created.
expression_mat <- as.matrix(normalized_counts)
minimal_set <- Biobase::ExpressionSet(assayData = expression_mat)
fit <- limma::lmFit(minimal_set, model_design)
Then, we use empirical Bayes to compute differential expression.
fit2 <- limma::eBayes(fit, trend = TRUE)
(Answer to Q2) The Benjamini Hochberg method was used as the paper associated with the experiment specified that false discovery rate was used to correct for multiple testing.
output_hits <- limma::topTable(fit2, coef = ncol(model_design), adjust.method = "BH", number = nrow(expression_mat))
# Then, we sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
(Answer to Q1) 2376 genes are differentially expressed pass the threshold p-value of 0.05. 0.05 was chosen as it is the most commonly used cutoff for significance, which signifies a 95% chance that differential expression would not be observed if the condition had no effect.
length(which(output_hits$P.Value < 0.05))
## [1] 2376
(Answer to Q2) On the other hand, 504 genes pass correction.
length(which(output_hits$adj.P.Val < 0.05))
## [1] 504
Figure 5: Differentially expressed genes are coloured and bolded in red.
Figure 6: We can see that the MA plot shows a high degree of confidence in RAD50 that it is differentially expressed and has high regulation.
(Answer to Q4) Next, we will create a heatmap to visualize upregulation and downregulation of the top hits in the data according to a colour scale. Based on Figure 7, we see that the conditions are clustered together due to the following groupings:
-TA-4 = TH1 and -TA-1 = TH1
-TA-2 = TH2 and -TA-5 = TH2
-TA-6 = TH17 and -TA-3 = TH17
Figure 7: Not only are the conditions clustered together, but their genes show similar patterns of upregulation and downregulation.
(Answer to Q1) Since we are focused on threshold calculation rather than rank calculation, options included DAVID, EnrichR, and g:Profiler. g:Profiler was used due to familiarity and because it is a popular tool for functional enrichment analysis and visualization of gene lists. g:Profiler was used through the gprofiler2 R package rather than through their web browser.
I will start by creating a thresholded list of genes, by calculating their rank and then retrieving upregulated and downregulated genes that are significant (pval < 0.05).
output_hits[, "rank"] <- -log(output_hits$P.Value, base = 10) * sign(output_hits$logFC)
output_hits <- output_hits[order(output_hits$rank), ]
upregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 &
output_hits$logFC > 0)]
downregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 &
output_hits$logFC < 0)]
all_genes <- rownames(output_hits)
To present the results, I will store them in tables.
dir.create(paste0(getwd(), "/data"))
write.table(x = upregulated_genes,
file = file.path("data", "GSE66261_upregulated_genes.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x = downregulated_genes,
file=file.path("data","GSE66261_downregulated_genes.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x = data.frame(genename = rownames(output_hits), F_stat = output_hits$rank),
file = file.path("data", "GSE66261_ranked_genelist.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
There are 1035 upregulated genes that are significant.
length(upregulated_genes)
## [1] 1035
On the other hand, there are 1341 upregulated genes that are significant.
length(downregulated_genes)
## [1] 1341
(Answer to Q2) To maintain consistency with the differential expression analysis, Benjamini Hochberg FDR was used as the correction method for multiple hypothesis testing. Unfortunately, the corresponding paper for GSE66261 did not specify which annotation sources that they used for gene set enrichment.
GO:BP was used as gene ontology is a popular source for identifying relevant biological processes. In addition, KEGG, Reactome (REAC), and Wiki Pathways (WP) were all included as parameters due to a specific interest in learning the biological pathways associated with the genes of interest. Version 0.2.1 of the R package gprofiler2 was used, which is presumed to use the same version as the web browser tool. The web g:Profiler version is e105_eg52_p16_e84549, which was last updated on 03/01/2022.
gost_res <- gprofiler2::gost(all_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
(Answer to Q3) A p-value threshold of 0.05 was used as previously done for differential expression analysis. 3862 genesets were returned in total, with nearly 75% (2861 out of 3862) coming from GO:BP.
nrow(gost_res$result)
## [1] 3862
table(gost_res$result$source)
##
## GO:BP KEGG REAC WP
## 2861 152 658 191
(Answer to Q3) If the p-value threshold is set to a more stringent value of 0.01, 3026 genesets are returned instead.
gost_res_stringent <- gprofiler2::gost(all_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.01,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
nrow(gost_res_stringent$result)
## [1] 3026
table(gost_res_stringent$result$source)
##
## GO:BP KEGG REAC WP
## 2208 122 559 137
Observing the plot function that gprofiler2 provides as shown in Figure 8, the term names whose -log10(p-adj) value is greater than 16 are the most significant and the ones that we are most interested in.
Figure 8: Despite the visualization, it is hard to gauge the number of term names that we are most interested in in due to the capping of values whose -log10(p-adj) > 16.
(Answer to Q4) Next, we will run the g:Profiler analysis against the upregulated and downregulated genes both, with a threshold of 0.05.
gost_res_up <- gprofiler2::gost(upregulated_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
gost_res_down <- gprofiler2::gost(downregulated_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
(Answer to Q4) Using all genes, we see that the term names that are most enriched are involved in metabolic and biosynthetic processes.
knitr::kable(gost_res$result$term_name[1:10])
| x |
|---|
| cellular macromolecule metabolic process |
| organic substance biosynthetic process |
| organonitrogen compound metabolic process |
| biosynthetic process |
| cellular biosynthetic process |
| cellular protein metabolic process |
| regulation of cellular metabolic process |
| protein metabolic process |
| macromolecule biosynthetic process |
| regulation of primary metabolic process |
(Answer to Q4) Using only upregulated genes that are significant, we see that the term names that are most enriched are still involved in metabolic and biosynthetic processes.
knitr::kable(gost_res_up$result$term_name[1:10])
| x |
|---|
| regulation of cellular process |
| biological regulation |
| small molecule metabolic process |
| biosynthetic process |
| oxoacid metabolic process |
| organic acid metabolic process |
| organic substance biosynthetic process |
| carboxylic acid metabolic process |
| regulation of biological process |
| regulation of cellular metabolic process |
(Answer to Q4) On the other hand, enrichment of downregulated genes that are significant show term names that are involved in immune response and regulation as well as response to stimuli. Based on the results using all genes, this shows that upregulated genes dominate the landscape by contributing more to differential expression and downstream pathways.
knitr::kable(gost_res_down$result$term_name[1:10])
| x |
|---|
| immune system process |
| regulation of response to stimulus |
| response to stimulus |
| regulation of cellular process |
| positive regulation of biological process |
| immune response |
| cellular response to stimulus |
| biological process involved in interspecies interaction between organisms |
| intracellular signal transduction |
| response to stress |
Answer: Yes, they do. The original paper states that long non-coding RNAs (lncRNAs) play essential roles in arrays of cellular processes, which is consistent with the over-representation results of all genes and upregulated genes. The paper also states that differentiation of T helper cells is a critical step to adaptive immune response to pathogens, which is consistent with the over-representation results of downregulated genes.
Answer: Yes, the paper “Long Non-coding RNAs: Major Regulators of Cell Stress in Cancer” by Connerty et al. explains that lncRNAs have diverse roles in cellular processes such as acting as epigenetic modulators and promoting or inhibiting transcription. On the other hand, the textbook “Molecular Biology of the Cell. 4th edition.” by Alberts et al. states that helper T cells are required for almost all adaptive immune responses.
The ranked gene list was created in A2 and stored in the data folder as GSE66261_ranked_genelist.txt. The code below then converts it into a .rnk file.
GSE66261_ranked_genelist <- read.table("data/GSE66261_ranked_genelist.txt", sep = "\t")
colnames(GSE66261_ranked_genelist) <- c("GeneName", "rank")
write.table(GSE66261_ranked_genelist, file = "data/GSE66261_ranked_genelist.rnk", sep = "\t", row.names = FALSE, quote = FALSE)
I used GSEA version 4.2.3 and ran the GSEA preranked analysis. The geneset used was the GO biological pathways file curated by the Bader Lab, and downloaded using the code below. The geneset is dated April 1, 2022.
data_dir <- paste0(getwd(), "/")
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = RCurl::getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred
# from electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(data_dir, gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
When running GSEA, the following parameters were used:
The positive phenotype in this study are genes that are upregulated when exposed to TH1, TH2, and TH17 conditions, while the negative phenotype represents genes that are downregulated when exposed to the same conditions.
With a nominal pvalue < 5%, 76 gene sets were significantly upregulated in the positive phenotype as shown in Figure 1. Top gene sets enriched for terms such as metabolic processes, cell differentiation, and development. The top gene of the most enriched gene set is TP53I3, whose translated protein is involved in cellular responses to oxidative stresses according to Gene Cards.
On the other hand, using the same pvalue cutoff, 1070 gene sets were significantly upregulated in the negative phenotype as shown in Figure 1. Top gene sets enriched primarily for immune response and response to viruses. The top gene of the most enriched gene set is IDO1, whose translated protein plays a role in antimicrobial and antitumor defense according to Gene Cards.
Figure 1: GSEA results
The thresholded analysis using g:Profiler on upregulated genes shows term names that are specific to metabolic and biosynthetic processes. Likewise, the top gene set in the positive phenotype for GSEA is related to metabolic process. However, the GSEA results also showed term names such as different types of development, morphogenesis, and cell differentiation as shown in Figure 2. This indicates that the results as a whole aren’t similar.
Figure 2: GSEA - Gene set term names for positive phenotype
On the other hand, the thresholded analysis using g:Profiler on downregulated genes shows term names such as immune response and regulation as well as response to stimuli. Gene sets returned by GSEA for the negative phenotype are also enriched for terms related to immune response. However, the response types are more diverse such as interferon gamma and TNFA signalling as shown in Figure 3.
Figure 3: GSEA - Gene set term names for negative phenotype
Despite some similarities between the g:Profiler results and GSEA results, this is not a straightforward comparison. This is because the thresholded analysis only looked at a subset of genes that were significant, while the non-thresholded analysis used all genes. This is evident in the terminology returned by GSEA, with a more diverse and unique set of term names to reflect all the genes.
Cytoscape 3.9.1 was used with EnrichmentMap installed through the App Manager.
An enrichment map was created by scanning the GSEA output folder for enrichment data. Through the “Analyze Network” tool, it is revealed that there are 651 nodes and 1639 edges as shown in Figure 4.
When creating the map, thresholds included a node cutoff of pval = 0.05, and an edge cutoff of pval = 0.375. As shown in Figure 5, all nodes are coloured blue, which indicates that only downregulated genes passed the threshold when mapping.
Figure 5: Initial network after enrichment mapping
The network was annotated using the AutoAnnotate tool available in Cytoscape. The following parameters were used:
Figure 6: A publication ready figure of the non-annotated network
Figure 7: A publication ready figure of the annotated network
Figure 8: Corresponding legend of both networks
The network was collapsed to a theme network using AutoAnnotate’s coSE Cluster Layout, as shown in Figure 9. Major themes are represented by large clusters of gene sets/nodes with a high number of connections/edges, with two being particularly relevant:
Since all gene sets seem to contain downregulated genes only, both themes ultimately line up with the model and GSEA results as they are both implicated in immune response. However, the annotations reveal the specific themes that result in the immune response and allow for further investigation of their pathways.
Figure 9: Collapsed theme network
The enrichment results do support the conclusions of the original paper. Specifically, the “Molecular immuno adaptive” and “cd4 alpha differentiation” annotations highlight the differentiation of naive T helper cells (also known as CD4+ T cells) into effector T helper subsets through TH1, TH2, and TH17 polarizing conditions. The paper mentions that this is a critical step for adaptive immune response to pathogens.
The enrichment results differ from the A2 results as the GSEA results and subsequent annotation provide more information than simply immune response, which is too general and broad. Rather, the second set of annotations “Cascade initiated toll” and “dectin downstream kb” specify the signalling pathways at play during the adaptive immune response.
Yes, the paper “Toll-like receptor 4 signaling in T cells promotes autoimmune inflammation” by Reynolds et al. specifies that TLR4 is expressed in CD4+ T cells. In addition, the paper “Dectin-1 Signaling Update: New Perspectives for Trained Immunity” by Mata-Martinez et al. states that Dectin-1 signalling ignites the differentiation of naive CD4+ T cells to a TH1 or TH17 phenotype.
A post analysis to the network was done to analyze 4 transcription factors which were determined to be key for governing cell fate decisions according to the paper. Specifically, they play a role in the selective activation and silencing of genes encoding lineage specific cytokines. The transcription factors are:
The MSigDB human transcription factors file from download.baderlab.org (Human_TranscriptionFactors_MSigdb_April_01_2022_symbol.gmt) was initially used. It was then filtered to only include gene sets containing one of the 4 transcription factors mentioned. This code is provided in the journal entry associated with this assignment. The filtered signature gene set contains 99 gene sets in total, which was fed into Enrichment Map. A Mann-Whitley two-sided test was implemented for the post analysis. A cutoff of 0.0001 was used to limit the number of gene sets analyzed and to improve runtime. This led to 10 gene sets being fed into the analysis, as shown in Figure 10.
Figure 10: Signature gene sets selected for post analysis
Looking at a zoomed image of the annotated enrichment map after post analysis, we see that TGGAAA_NFAT_Q4_01 is highly associated with the two main clusters of molecular immuno adaptive and cascade initiated toll. Upon further search, the paper “NFAT, immunity and cancer: a transcription factor comes of age” by Martin R. Müller & Anjana Rao reveals that the class of NFAT transcription factors are of primary importance during T cell activation and differentiation. However, the paper fails to mention the expression of NFAT when exposed to TH1, TH2, and TH17 conditions. Thus, there should be a follow up with the authors of the paper by revealing the results of the post analysis and asking whether NFAT was examined in detail.
Figure 11: Enrichment map with annotations after the post analysis
Figure 12: Zoomed in enrichment map with annotations after the post analysis